We will be analyzing the dataset of WT-5 from 10X Genomics. The Estimated Number of Cells: 2,465. The Mean Reads per Cell: 79,415. The Median Genes per Cell: 4,136. The Mapping Transcriptome: GRCh38. The Mapping software: Cell Ranger Version 2.1.0 (STAR). The Number of Reads: 195,759,969. The Reads Mapped to Genome: 84.8%. The Reads Mapped Confidently to Genome: 81.9%. We start by reading in the data. All features in Seurat have been configured to work with sparse matrices which results in significant memory and speed savings for 10x data.
The following violin plots display the number of genes, UMI, and percentage of mitochondiral genes in all cells. The GenePlots are used to visualize the gene-gene relationship. Cells are filtered based on user-defined criteria, such as number of genes expressed in a cell. Can also filter cells based on the percentage of mitochondrial genes present. Here we filter out cells that have unique gene counts over 8,000 or less than 2000, or have > 10% mitochondiral genes.
After removing unwanted cells from the dataset, the next step is to normalize the data. By default, Seurat applies a global-scaling normalization method “LogNormalize” that normalizes the gene expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result.
Seurat calculates highly variable genes and focuses on these for downstream analysis. Seurat calculates the average expression and dispersion for each gene. This helps control for the relationship between variability and average expression. The parameters here identify ~9,000 variable genes.
The Single cell data likely has unwanted sources of variation, for example, technical noise, batch effect, or even biological sources of variation. Regressing these variations out can improve downstream dimensionality reduction and clustering. Seurat constructs linear models to predict gene expression based on user-defined variables. It can regress out cell-cell variation in gene expression driven by batch (if applicable), cell alignment rate (as provided by Drop-seq tools for Drop-seq data), the number of detected molecules, and mitochondrial gene expression (can also learn a cell-cycle score for cycling cells).
Next we perform PCA on the scaled data. By default, those variables genes are used as input. We have typically found that running dimensionality reduction on highly variable genes can improve performance. This step helps to decide PCs for further dimension reduction and analysis steps. This is the 1st and 2nd PCs of PCA
These are candidate PCs in PCA analysis and genes which drive each PC.
## [1] "PC1"
## [1] "PPP1R14B" "DHCR24" "PIF1" "NR6A1" "HMGCS1"
## [1] ""
## [1] "EIF5A" "ATP5G3" "NDUFA4" "FABP5" "MYL6"
## [1] ""
## [1] ""
## [1] "PC2"
## [1] "HSP90AA1" "SET" "KPNA2" "HNRNPDL" "HSPA8"
## [1] ""
## [1] "PFN1" "HIST1H4C" "COX7C" "RP11-148B6.1"
## [5] "UBL5"
## [1] ""
## [1] ""
## [1] "PC3"
## [1] "HNRNPH1" "CDC42" "SOX11" "CTNNB1" "CTTN"
## [1] ""
## [1] "NPM1" "ACTG1" "HSPD1" "XRCC6" "FDPS"
## [1] ""
## [1] ""
## [1] "PC4"
## [1] "FZD7" "LINC00458" "CTC-286N12.1" "MT1H"
## [5] "RP11-359M6.1"
## [1] ""
## [1] "MDK" "UTF1" "PDGFA" "ACTG1" "IRX2"
## [1] ""
## [1] ""
## [1] "PC5"
## [1] "CTA-29F11.1" "HSPA8" "CCND1" "GABPB1-AS1" "ILF3-AS1"
## [1] ""
## [1] "NPM1" "HIST1H4C" "ACTB" "HNRNPH1" "PIF1"
## [1] ""
## [1] ""
To overcome the extensive technical noise in any single gene for scRNA-seq data, Seurat clusters cells based on their PCA scores, with each PC essentially representing a ‘metagene’ that combines information across a correlated gene set. Determining how many PCs to include downstream is therefore an important step. In our dataset, it looks like the elbow would fall around PC 20. Therefore, First 20 PCs are selected for further cell clustering.
Seurat now includes an graph-based clustering approach. Briefly, this method embed cells in a graph structure - for example a K-nearest neighbor (KNN) graph, with edges drawn between cells with similar gene expression patterns, and then attempt to partition this graph into highly interconnected ‘communities’.
Seurat continues to use tSNE as a powerful tool to visualize clustering, where tSNE aims to place cells with similar local neighborhoods in high-dimensional space together in low-dimensional space.
We include several tools for visualizing marker expression. VlnPlot (shows expression probability distributions across clusters), and FeaturePlot (visualizes gene expression on a tSNE or PCA plot) are our most commonly used visualizations.